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National Aeronautics and Space Administration 
Lewis Research Center 
Cleveland, Ohio 44135 


SUMMARY 

A technique allowing time-staggered solution of partial differential 
equations Is presented In this report. Using this technique, called time- 
partitioning, simulation execution speedup Is proportional to the number of 
processors used because all processors operate simultaneously, with each 
updating the solution grid at a different time point. The technique Is 
limited by neither the number of processors available nor by the dimension of 
the solution grid. Time-partitioning was used to obtain the flow pattern 
through a cascade of airfoils, modeled by the Euler partial differential equa 
tlons. An execution speedup factor of 1.77 was achieved using a two processor 
Cray X-MP/24 computer. 


INTRODUCTION 

The trend In aeropropul slon system designs has been to try to obtain more 
and more power with less and less weight. To achieve this, simplicity Is gen- 
erally sacrificed In order to achieve the Increase In performance. Aeropro 
pulsion systems, and their components, become more complex with each new 
design. 

Computational Fluid Dynamics (CFD) is playing an Increasingly important 
role In the design of aeropropul slon systems. This Is due to: (1) the high 
cost of building hardware; (2) the time and expense required to conduct wind 
tunnel tests of new designs; (3) the lack of facilities to realistically test 
new designs (testing the National Aerospace Plane concepts at hypersonic 
speeds, for instance); (4) advances In computational technology; (5) Increased 
understanding of fundamental physics. 

The objective of CFD Is to build an understanding of these advanced sys- 
tems Into mathematical models which accurately represent the complicated phys- 
ics taking place In these systems. The good news Is that these mathematical 
models are evolving through Intensive research (both experimental and analyti- 
cal), but the bad news is that the models are so detailed that time-accurate 
solutions cannot currently be obtained In a reasonable amount of time. Holst 
(ref. 1) projects that direct simulation of a Navler-Stokes airfoil simulation 
using no simplifying assumptions would require approximately 10 16 computer 
operations. This amounts to 4 months cpu-tlme using a state-of-the-art glga- 
flop computer such as the Cray-2. If solutions were available In minutes or 
hours, optimized designs could be generated on the computer. Therefore, orders 
of magnitude Increases in computing speed are needed to make CFD practical for 
aeropropul slon design optimization. 

It Is generally recognized that computers are fast approaching speed 
limits. As a result, the 1980's has seen a growing Interest In combining 
state-of-the-art hardware with new architectures and software techniques to 


try to achieve the required speedup. Approaches have included vector proces- 
sing (single instruction-multiple data), multiprocessing (multiple instruction 
multiple data), and data-driven architectures. Williams and Bobrowicz (ref. 2) 
indicate speedup rates of ten or more can be attained combining vector proces 
sing with multiprocessing. 

Today, almost all supercomputers use vector processing and several (e.g.. 
Cyber 205, Cray X-MP, Cray-2) use multiple vector processors. Programming 
these supercomputers Involves the use of vectorizing compilers that convert 
source codes, originally intended for conventional single, scalar processor 
computers, to codes that run efficiently on the vector processors. While 
today's supercomputers represent a step in the right direction, they still 
offer only a fraction of the needed computing power because of the limited 
number of processors (4 or less) and the limited capabilities of the software. 

Data-driven 'approaches to parallel processing have been proposed (refs. 3 
and 4) that involve large numbers (hundreds or thousands) of processors. In 
these cases, calculations are assigned to processors on a single operation 
basis. Hundreds of processors could be used to achieve significant speedups 
in simulations where like numbers of individual operations can be simultane 
ously carried out. However, software is required to control these calculations 
and to assign the operations to the processors. The sequencing of hundreds of 
computers is a tremendous software task. 

It seems clear that tapping the tremendous potential of parallel proces 
sing will depend upon advancements in software technology. In particular, 
software needs to be developed which can automatically map complex, multi 
dimensioned codes onto parallel architectures, making effective use of availa- 
ble scalar and vector processing resources. 

Researchers at NASA Lewis Research Center are actively engaged in a 
research program (refs. 5 to 14) to explore parallel processing techniques for 
analyzing Internal flows in aeropropulsion systems. One of the objectives of 
that research is to identify parallel architectures and algorithms that are 
well suited for three-dimensional Navier- Stokes flow solvers. Another objec- 
tive is to devise techniques for effectively partitioning the solver calcula- 
tions for parallel solution. 

This paper discusses a partitioning technique which allows calculations 
at the next time Interval to begin before all calculations at the current time 
Interval are completed. The authors refer to the method as time-partitioning. 
The next section of this report discusses time-partitioning and other parti- 
tioning methods, pointing out the advantages and disadvantages of each. Time- 
partitioning is then applied to a restriction in a flow field problem. This 
example represents an Important class of problems relating to computing flows 
in turbomachinery cascades. The resulting speedup, obtained using the Cray 
X-MP, is discussed, as well as the steps required to develop and Implement the 
time-partitioned simulation. 


PARTITIONING METHODS 

Partitioning the simulation into work units and allocating those work 
units to processors (packing) is one of the most difficult tasks which must be 
addressed in parallel processing. The way that the simulation is partitioned 
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and packed directly affects the speed at which the simulation executes and the 
efficiency of processor use. Work loads should be balanced among the proces 
sors to eliminate excess processor Idle time. The level of parallelism being 
considered greatly affects the ease with which processor work load balance Is 
achieved, as shown In table I. This table summarizes key characteristics of 
three partitioning methods about to be discussed. 

D ata-driven architectures consider parallelism In a simulation at Its 
most basic operational level. An operation Is considered the basic unit of 
work. When an operation Is triggered, a processor Is assigned by the system 
to carry out that operation. When that single calculation Is completed, the 
processor Is free to be assigned to another waiting calculation. In this case, 
then, processor load Is very simply a single operation. Processor assignment, 
on the other hand. Is very difficult. In a large simulation, literally hun 
dreds of additions and multiplications may be ready to be carried out simul- 
taneously. Processors to service them are normally assigned on the fly while 
the simulation Is executing. The bookkeeping for tracking which processors 
are currently busy and which are available for assignment Is tremendous. 
Sophisticated software Is required to manage this task. 

Assigning the equation as the basic unit of work eliminates the require- 
ment of having to assign processors on the fly to carry out parallel calcula- 
tions. Equations are assigned for computation to the processors before 
execution begins. Relatively few processors are required to execute a Simula 
tlon. (The helicopter engine simulation of reference 10 required only six to 
achieve minimum execution time.) However, using the equation as a basic unit 
of work makes this architecture one level removed from the data driven archi- 
tectures. Whereas before, work balance on the processors was no consideration, 
now It Is an Important consideration. Depending on the complexity of the 
equation, the time to calculate the output of an equation will vary. Hence, 
work load balance among the processors cannot be achieved by just assigning an 
equal number of equations to each processor. Equation execution times must be 
determined and sequential calculation paths must be Identified. The longest 
such path Is designated the critical path because Its execution time Is the 
minimum possible execution time of the simulation. To achieve this minimum 
execution time the critical path equations must reside on a processor by them- 
selves. The other paths must be packed on remaining available processors in a 
way that the execution time of no processor exceeds that of the critical path 
processor. The entire process of partitioning and packing mathematical models 
for parallel calculation has been automated. Reference 15 discusses the pro- 
cedure In detail; block diagrams of the process are included In the report. 

Each of the partitioning methods discussed above performs all calculations 
within the same time Interval. Time-partitioning , as the name Implies, per- 
forms calculations at different time Intervals simultaneously. This parti 
tlonlng technique is particularly suited for problems requiring solution of 
partial differential equations over a grid. 

Whereas other partitioning methods Identify vec tori zatlon as scalar 
parallelism, time-partitioning maintains vectorl zatlon within the grid 
calculations . 

For simplicity, the discussion here will assume a two-dimensional grid. 

The assumption Is made for convenience only In desclblng the time-partitioning 
concept and does not Imply limitations on the generality of the method. The 
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concept readily extends to n-dimensional grids. However, time-partitioning 
does require that current state variable values be dependent only on 
neighboring- node past values. This condition would be met if an explicit 
integration method were used, for example. 

f ime-partitioning can be used to reduce the effective calculation time of 
the simulation if parameter update calculations over the grid are completed in 
the systematic fashion described in the following paragraph. At some point, 
before all grid node updates for the current time interval have been completed, 
sufficient information will be available to begin updating grid nodes at the 
next time interval. 

Suppose that the current parameter values at each node of a rectangular 
grid are dependent only on past values at two columns of neighboring nodes. 

If the grid nodes are updated columnwise from left to right, once three col 
umns of nodes have been updated, sufficient information exists to begin updat 
ing the leftmost node columns at the second time interval. Since the same 
kinds of calculations are taking place at each node, calculation time at each 
node is comparable. Theoretically, then, the first processor set should remain 
a fixed distance (that is, three columns of nodes) ahead of the second proces- 
sor set. Hence, there should be no delays caused by the second processor set 
having to wait for required information from the first. 

Likewise, once the second processor set has updated three columns of 
nodes, sufficient information again exists to begin updating the simulation at 
the third time Interval. This process can continue until all processor sets 
available are being used or until the first processor set has completed its 
time Interval update. In either case, the first processor set will update the 
next time Interval. The process continues to repeat until the simulation run 
has been completed. A diagram of the time- partitioning execution process is 
shown In figure 1 . 

An outstanding feature of time-partitioning is the ease with which the 
technique can be implemented. Basically the same equations are executing on 
each processor, but at different time points. Because of this, the processor 
work load is almost naturally balanced. The process can be Implemented with 
as little as two processors, and the theoretical speedup factor realized is 
proportional to the number of processors used. Processor Idle time is virtu- 
ally nil . 

To use time-partitioning techniques requires that parameters at a node 
can be updated using only past values of parameters at some level of neighbor- 
ing nodes. Thus, a simulation using an Implicit integration method could not 
be time- partitioned due to the iterative nature of the solution and the inter- 
dependence of the parameter current values. Time-partitioning techniques were 
applied to a fluid-flow problem at Lewis Research Center using the two proces- 
sor Cray X-MP computer. The example problem used and the results obtained are 
discussed in the following sections. 


T IMt PART IT IONEO SIMULA1 ION DLVLL0PMLN1 

An Important class of fluid flow problem deals with computing flows in 
turbomachinery cascades. This flow information is vital to developing efll 
dent new turbomachinery designs. Calculating the flow about the cascade of 
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bicircular arc airfoils shown In figure 2 Is representative of this class of 
problems, and Is well-documented by Johnson and Chlma (refs. 16 to 19). 

The cascade of airfoils can be used to model many different systems. For 
instance, two adjacent airfoils could model the convergent- dl vergent nozzle of 
a jet engine. 

« 

Chlma and Johnson model the cascade of airfoils using the thin- layer ver- 
sion of the Navler- Stokes equations. The thin- layer assumption Is implemented 
by using a body-fitting coordinate system and neglecting the viscous terms In 
1 the coordinate direction along the body. Initial conditions are specified as 

uniform flow at the Isentroplc Mach number Implied by the ratio of exit static 
pressure to Inlet total pressure. Specified Inlet boundary conditions are 
total pressure, total temperature, and flow angle; at the exit, static pressure 
Is specified. For Invlscid flow, the tangency condition Is applied along solid 
surfaces as shown In figure 2. Starting with a 65 by 17 grid and using the 
multi-grid acceleration scheme discussed in reference 17, Chlma and Johnson 
achieved work reduction factors for Invlscid flow calculations ranging from 
1.14 (for choked flow conditions at Mach 0.73) to 4.02 (for low speed flow at 
Mach 0.2). For Mach 0.5, a work reduction factor of 3.31 was achieved. 

lime-partitioning techniques were applied to the cascade of airfoils 
problem for three reasons. First, as was mentioned above. It is an Important 
problem in computational fluid mechanics. Results obtained are Important In 
designing components which are more efficient than those currently available. 

A second reason Is that the Chlma Johnson multi-grid simulation could be 
used as a standard for verifying the results coming from the time-partitioned 
model being developed. A valid time-partitioned simulation would produce 
results consistent with those from the multi-grid simulation. 

And finally, the multi-grid simulation could be used as a basis for 
developing the time-partitioned model. Chlma and Johnson use a second order 
Runge-Kutta Integration update. This Is an explicit Integration technique 
requiring only past values to update state variables. This lends itself very 
nicely to time-partitioning. 

This time-partitioning study was carried out using a 33 by 9 grid (fig. 3) 
at Mach 0.5 conditions. Computations are made column-wise In the discussion 
which follows, although the grid can actually be updated either rowwise or 
columnwise. The former vectorizes a row of length 33 as opposed to a column 
of length 9; as shown later In the C0MPU1ATI0AL RESULTS section, the latter 
allows use of up to eleven processors as opposed to three. 

The Chlma- Johnson simulation was written with the Intent of executing the 
• code on a single, serial processor. Because of this, considerable reorganiza 

tlon of the simulation was required In order to make it conducive to parallel 
computation and time-partitioning techniques. 

« 

Reorganizing the Chlma- Johnson simulation to meet time-partitioning needs 
required considerable care. The simulation had been designed to carry out 
calculations, a portion at a time, over all nodes of the grid, one set of cal- 
culations being completed before the next set would begin. Time-partitioning 
requires that a column of nodes be updated completely before beginning the 
next column of nodes. Effecting this change was not straight forward. 
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In a simulation coded to execute on a single, serial processor, memory 
locations designated to hold updated parameter values can be used as scratch 
memory to hold Intermediate values for other calculations before those parame 
ters are updated. However, when time-partitioning Is being used, memory cannot 
generally be used for dual roles. Once a column of nodes Is updated, their 
values must not be changed because those values are required for use almost 
Immediately by another processor performing calculations at the next update 
Interval. Calculations In the time-partitioned simulation were arranged as a 
task that updated a column of nodes every time the task was called. Successive 
calls to the task updated the nodes a column at a time from left to right 
across the grid. Once a node column was updated. Including its boundary val- 
ues, no parameter value was changed until the node column was updated again at 
the next time Interval . 

Normally, local variables use the same memory locations throughout execu- 
tion of a simulation. However, to use both Cray X-MP processors simultaneously 
requires that the program code execute In stack mode. In this mode, local 
variables are not saved between subroutine calls. Every time that the stack 
Is accessed, different memory locations can be used for holding values of the 
local variable. A variable required to maintain Its value between subroutine 
calls must be a global variable. Care must be taken not to use local variables 
as counters, flags, or storage locations for needed Information at a subsequent 
time. One way of ensuring global status Is to Include the variable In a COMMON 
statement . 

The reorganized simulation was validated by executing It on a single 
processor In stack mode on the Cray X-MP. A steady state solution was obtained 
In 5.47 sec execution time and the results agreed with those from the Chlma- 
Johnson simulation. One thousand eight hundred and seventy calculation cycles 
were performed and residuals were less than 4 xl 0~ 1 1 . Residuals are a measure 
of the maximum differences between successive values of the state variables as 
simulation execution progresses. As the simulation approaches steady state 
conditions, the residuals approach zero. Chlma and Johnson use residuals to 
determine when the solution has converged (ref. 17). 


TIME- PARTITIONED SIMULATION EXECUTION 

As discussed above, the reorganized simulation was arranged as a task 
which updated a column of nodes. Successive calls to the task updated node 
columns from left to right across the grid. To execute the simulation In 
time-partitioned mode, a duplicate copy of the task code is required. Desig- 
nate these copies as Task 1 and Task 2 to distinguish them; however, they are 
Identical. Each updates a column of nodes from left to right across the grid 
with each successive call to that task. Processor set 1 always executes 
Task 1, and Processor set 2 always executes lask 2. The time partitioned sim- 
ulation Is executed on the Cray X-MP in the following manner. Task 1 Is called 
three successive times without calling Task 2. This updates the first three 
columns of nodes at time Interval 1, providing sufficient Information to begin 
updating the grid at time Interval 2. Hence, on the fourth call to lask 1, and 
on every call thereafter, a call Is also made to Task 2. Thus, Processor set 1 
Is updating the grid at odd multiples of time, while Processor set 2 Is simul- 
taneously updating the grid at even multiples of time. The reason Task 2 must 
lag Task 1 by three columns of nodes Is that the solution finite-difference 
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scheme uses second-order central differences for the fluxes and a fourth 
difference (5 point) artificial viscosity operator for damping. 

Since each of the tasks Is updating a column of nodes with each task call, 
they should also complete their respective column calculations at about the 
same time. To maintain control of the simulation, however, task wait mecha- 
nisms are Incorporated Into the code. This ensures that, as new calls to the 
two tasks are made, they begin executing simultaneously. Hence, Task 2 Is 
guaranteed to be lagging Task 1 by precisely three columns of nodes. 

By not using a task wait mechanism, the programmer would relinquish con- 
trol of the simulation. If both processor sets were freewheeling- that Is, 
executing their tasks Independently and as quickly as possible. Task 2 could 
actually end up leading Task 1 by the end of the simulation run. For example, 
a system Interrupt to Processor set 1 could momentarily delay Its calculations. 
Task 2 would then be using data In Its calculations which had not been updated 
by Task 1 . 


COMPUTATIONAL RESULTS 

For this Initial study of time-partitioning, only Euler equations govern 
Ing Invlscld flow have been considered. However, time- partitioning techniques 
are also applicable to Navler-Stokes equations governing viscous flow. Typical 
results obtained from the simulation are the Isomachs shown In figure 4. These 
results were obtained for Mach 0.5 flow conditions. The lines of constant Mach 
number form a profile of steady- state Mach number within the computational 
element (figure 2) for these flow conditions. The elapsed execution time of 
the simulation, however. Is what Is important for this report. 

As shown In table II, the two processor time-partitioned simulation 
achieved steady-state conditions In 3.09 sec. A total of 1870 calculation 
cycles were performed, and residuals were less than 4xl0 -11 . Using the time- 
partitioning techniques, an effective speedup factor of (5.47/3.09=) 1.77 was 
realized. This represents an efficiency of 89 percent with respect to the 

theoretical speedup factor of almost two (2 minus time to start the process). 

This Is consistent with the Cray X-MP multitasking overhead reported by Chen 
(ref. 20). The speedup factor Is significant Insofar as If more processors 
were available, a third task could have been set up to begin executing the 
third time Interval on the fourth call to Task 2, etc. Since a new task (and 

time Interval calculation) could begin every time three columns of nodes were 

calculated, a total of eleven (that Is, 33 t 3) processors could have been 
used In the solution on this example problem, with a theoretical speedup In 
solution time proportional to the number of processors used! (Notice that If 
calculations were made row-wise Instead of column-wise, only three processors 
could have been used.) 

More Investigations of time-partitioning will be required In order to 
answer queries raised by this Initial study. Foremost, Is determining how 
using additional processors affects the execution speedup factor obtained. 
Living In the real world that we do, attaining the predicted theorectlcal 
linear relationship hardly seems realistic. Using two processors, the speedup 
fell about 11 percent short. It Is reasonable to assume that the task wait 
mechanism Incorporated Into the code to maintain control of the simulation 
accounts for at least a part of that difference. How the task wait will 
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affect the execution time when three, four, eight, or more processors are used 
Is something that will have to be Investigated. Moreover, the speedup factor 
obtained using time-partitioning techniques theoretically should not depend 
heavily on the code or the size of the tasks to be executed. Whether time 
savings duplicate those obtained In this study when time-partitioning tech- 
niques are applied to other codes and other applications Is a question that 
must be Investigated. This Initial study has given some encouraging results, 
lime-partitioning shows potential for being a powerful parallel processing 
tool. Only through further Investigation will Its effectiveness be determined. 


CONCLUDING REMARKS 

Parallel processing promises to be a very effective tool for reducing 
wallclock execution time for many complex simulations. 

Time-partitioning techniques discussed In this report provide a means for 
solving systems of Euler and Navier-Stokes equations at several different 
time-steps simultaneously. The calculations take place In a time staggered 
fashion across the solution grid. 

Time-partitioning techniques were used to determine the steady state flow 
pattern through a cascade of airfoils. This Important computational fluid 
mechanics problem Is characterized by a set of Navier-Stokes partial differen- 
tial equations. Solution was over a two-dlmentlonal grid using a second order 
Runge-Kutta Integration. An execution speedup factor of 1.77 was achieved, 
using the two processors of the Lewis Research Center Cray X-MP computer. 
Results from this initial study are encouraging. Time-partitioning has the 
potential for providing an easy means of parallelizing explicit codes and 
obtaining execution speedup factors proportional to the number of processors 
used . 


The application of time-partitioning techniques Is not limited to a par- 
ticular number of processors. All processors available can be used. 

Further studies are required to Investigate the relationship between the 
execution speedup factor achieved and the number of processors used. Time- 
partitioning should be applied to a variety of codes from different applica- 
tions. Also, a computer system having at least four processors (preferably 
more) should be used for that Investigation. 

The authors weTcome discussions of the techniques presented In the paper, 
related techniques, and developments In the many other aspects of multlproces 
sor simulation. 
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TABLE I. - CHARACTERISTICS OF SOLUTION METHODS 


Solution 

method 

Processor assignment 

Basic unit 
of work 

Work load balance 

Processor 
idle time 

Processors 
requi red 

Data 

driven 

Difficult, assigned 
during execution 

Si ngle 
operation 

Natural balance 

None 

Many 

Equation 
uri ven 

Pre-assigned 

Single 

equation 

Diffiicult; requires 
packing 

Depends on 
packi ng 

FEW 

Time 

partition 

Pre-assigned 

Set of 
equations 

Easily balanced 

Virtual ly 
none 

FEW 


TABLE II. - COMPARISON OF SIMULATION EXECUTION 


Simul ation 
type 

Processors 

used 

Calculation 
cycles executed 

Execution 
time, sec 

Speedup 

factor 

Efficiency, 

percent 

Serial 

1 

1870 

5.47 

1.00 

100 

T ime- 

2 

1870 

3.09 

1.77 

89 

partitioned 
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FIGURE 1. - TIME-PARTITIONING EXECUTION PROCESS. 


SET N 




12 






FLOW 

DIRECTION 


TANGENCY CONDITION 


TOTAL PRESSURE SPECIFIED 
TOTAL TEMPERATURE SPECIFIED 
FLOW ANGLE SPECIFIED 


r COMPUTATIONAL ~"j STATIC PRESSURE 
DOMAIN j SPECIFIED 


TANGENCY CONDITION 


FIGURE 2. - CASCADE OF BICIRCULAR ARC AIRFOILS. 
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FIGURE 3. - CASCADE OF AIRFOILS SOLUTION GRID. 



FIGURE 4. - ISOMACHS FOR INVISCID BICIRCULAR ARC CASCADE. MACH 0.5. 
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